true

Load results

load("./processed_data_files/what_we_find_VS_ELM_clust20171019.RData")
rm(list = ls()[!ls() %in% c("printTable","XYZ.p.adjust","res_count", "resJustFISHER", "resPmult", "resPmultInv", "sequential_filter")])
doman_viral_pairs = T # if false - add human proteins containing domains
motifs = F # based on Vidal's data

Empirical p-value for seing a domain a number of times

What is the chance of randomly seeing any domain the observed number of times among all proteins that interact with a specific viral protein

printTable(res_count, doman_viral_pairs = doman_viral_pairs, motifs = motifs, destfile = "./results/domains_fdr_corrected_empirical_p_value.tsv")
plot(res_count)

Fisher test for co-occurence of binding viral protein and containing a domain

printTable(resJustFISHER, doman_viral_pairs = doman_viral_pairs, motifs = motifs, destfile = "./results/domains_fdr_corrected_fisher_test.tsv")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
plot(resJustFISHER,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Fisher's Exact Test pvalue", breaks = seq(-0.01, 1.01, 0.01))

Combining the p-value for seing a domain a number of times and the p-value for co-occurence of binding viral protein and containing a domain

Multiplying p-values

printTable(resPmult, doman_viral_pairs = doman_viral_pairs, motifs = motifs, destfile = "./results/domains_fdr_corrected_empirical_p_value_X_fisher_test.tsv")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
plot(resPmult,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Fisher's Exact Test pvalue * \nempirical P value for observing domain in N proteins", breaks = seq(-0.01,1.01, 0.01))

Multiplying the inverse of p-values

The idea is that low p-values mean higher chances of detecting a signal. I am not sure this is statistically correct, but it allows to remove p = 1.0 domains (because of multiplying Fisher p value by 0, the inverse of empirical pvalue for the frequency).

printTable(resPmultInv, doman_viral_pairs = doman_viral_pairs, motifs = motifs, destfile = "./results/domains_fdr_corrected_inv_empirical_p_value_X_inv_fisher_test.tsv")
plot(resPmultInv,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Inverse of Fisher's Exact Test pvalue * \ninverse of empirical P value for observing domain in N proteins", breaks = seq(-0.01, 
                                                                                                                                                                                                      1.01, 0.01))

2-step filtering, ranking by Fisher test p-value

printTable(sequential_filter, doman_viral_pairs = doman_viral_pairs, motifs = motifs, destfile = "./results/domains_fdr_corrected_sequential_filter.tsv")
plot(sequential_filter, IDs_interactor_viral + IDs_domain_human ~ p.value, xlab = "Fisher's Exact Test pvalue", breaks = seq(-0.01, 
                                                                                                                             1.01, 0.01))

PermutResult2D(res = sequential_filter, N = 500, value.cols = c("p.value", "Emp.p.value")) +
    ggtitle("2D-bin plots of 250 top-scoring viral protein - human domain pairs, \n statistic: count of a domain among interacting partners of a viral protein")
## Warning in plyr::split_indices(scale_id, n): '.Random.seed' is not an
## integer vector but of type 'NULL', so ignored
## Warning: Removed 242 rows containing non-finite values (stat_density).
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 242 rows containing missing values
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 278 rows containing non-finite values (stat_bin2d).
## Warning: Removed 242 rows containing non-finite values (stat_density).

How all these methods perform at finding ELM domains?

The absolute numbers

compared the total number of domains found / the total in ELM

The enrichment in ELM domains over the background

P-value for the enrichment in ELM domains over the background